19 research outputs found

    A Graphical Modelling Approach to the Dissection of Highly Correlated Transcription Factor Binding Site Profiles

    Get PDF
    <div><p>Inferring the combinatorial regulatory code of transcription factors (TFs) from genome-wide TF binding profiles is challenging. A major reason is that TF binding profiles significantly overlap and are therefore highly correlated. Clustered occurrence of multiple TFs at genomic sites may arise from chromatin accessibility and local cooperation between TFs, or binding sites may simply appear clustered if the profiles are generated from diverse cell populations. Overlaps in TF binding profiles may also result from measurements taken at closely related time intervals. It is thus of great interest to distinguish TFs that <em>directly</em> regulate gene expression from those that are <em>indirectly</em> associated with gene expression. Graphical models, in particular Bayesian networks, provide a powerful mathematical framework to infer different types of dependencies. However, existing methods do not perform well when the features (here: TF binding profiles) are highly correlated, when their association with the biological outcome is weak, and when the sample size is small. Here, we develop a novel computational method, the Neighbourhood Consistent PC (NCPC) algorithms, which deal with these scenarios much more effectively than existing methods do. We further present a novel graphical representation, the Direct Dependence Graph (DDGraph), to better display the complex interactions among variables. NCPC and DDGraph can also be applied to other problems involving highly correlated biological features. Both methods are implemented in the R package <em>ddgraph</em>, available as part of Bioconductor (<a href="http://bioconductor.org/packages/2.11/bioc/html/ddgraph.html">http://bioconductor.org/packages/2.11/bioc/html/ddgraph.html</a>). Applied to real data, our method identified TFs that specify different classes of cis-regulatory modules (CRMs) in Drosophila mesoderm differentiation. Our analysis also found depletion of the early transcription factor Twist binding at the CRMs regulating expression in visceral and somatic muscle cells at later stages, which suggests a CRM-specific repression mechanism that so far has not been characterised for this class of mesodermal CRMs.</p> </div

    DDGraphs for the 5 CRM classes inferred by the NCPC algorithm at <i>α</i> = 0.05.

    No full text
    <p>Variables in green circles are target variables. Variables in ovals are inferred causal neighbours. Variables in rectangles are inferred to have indirect dependence with the target. Values on the edges are (unadjusted) P-values from conditional independence tests. The same NCPC algorithm with no multiple testing correction was used as in the synthetic data benchmark. See <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002725#pcbi-1002725-g001" target="_blank">Figure 1</a> for the graphical vocabulary.</p

    Comparison of DDGraphs and DAGs.

    No full text
    <p>(<b>A</b>) The causal neighbourhood of the target variable T consists of variables X1 and X2, while T's Markov blanket consists of X1, X2, X4 (in ovals). The remaining variables X3 and X5 have indirect dependence (in rectangles). The DDGraph (left) and the DAG (right) represent the same conditional dependencies. The causal neighbourhood/the Markov blanket and the variable in indirect dependence are distinguishable by the variable shapes in the DDGraph, but have to be inferred in the DAG by following the edges. (<b>B</b>) joint dependency patterns representable in the DDGraph (left) cannot be represented by DAGs (right). The DAG shown here represents the conditional independencies between X1 (or X2) and T given X2 (or X1), but it does not represent the marginal dependency between X1 (or X2) and T. Neither this DAG or any other DAG can represent the entire joint dependency pattern.</p

    Combinatorial patterns of TFs in inferred causal neighbourhoods.

    No full text
    <p>For each combinatorial pattern we show the number of CRMs with this pattern in the CRM class and that in the rest of CRMs (percentages are given in parenthesis). The difference in the two frequencies (CRM class vs rest) and the corresponding P-value are given in the last two columns. P-values were computed from Fisher's exact test for each combination and adjusted for multiple testing using the Benjamini-Hochberg method. See <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002725#s4" target="_blank">Materials and Methods</a> for details. Frequency differences are colour-coded: blue for decrease in the CRM class, and orange for increase in the CRM class.</p

    Clustered pairwise correlation matrix of the 15 transcription factor binding profiles over all 310 CRMs.

    No full text
    <p>Note that the cluster that consists of Mef2 8–12 h and Bin 6–12 h (lower left corner of the matrix) is anti-correlated with early Twi 2–4 h binding.</p

    Proportion of correct predictions for the “Time” scenario.

    No full text
    <p>Each cell shows the mean proportion of correct predictions (with 95% confidence intervals) averaged over 1000 data sets generated in each case. Highest prediction proportions accounting for variation in the data (pairwise T-tests with a cut-off of 0.001 for the P values) are shown in bold. See <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002725#s4" target="_blank">Materials and Methods</a> for the generation of the synthetic data and for the calculation of the correct prediction proportion.</p

    Two scenarios for generating the synthetic data with correlated variables.

    No full text
    <p>While the synthetic data were generated for a network of 15 explanatory variables, only variables X1 and X2 have direct dependence with the target variable T, and therefore constitute the causal neighborhood of <i>T</i>. Variable X3 is included as the confounding variable. (<b>A</b>) The “Time” scenario in which X1, X2 and X3 correspond to three time points with stronger correlation between X1 and X2 and between X2 and X3 than between X1 and X3. (<b>B</b>) The “Hidden” scenario in which X1, X2 and X3 are correlated due to a common cause <i>H</i> in the network. This common cause is used in data generation, but is not available to algorithms.</p

    The graphical vocabulary of the DDGraph.

    No full text
    <p>The vocabulary consists of five types of nodes and two types of edges. For the edges, directed edges ending with dots indicate conditional independences between <i>X<sub>k</sub></i> and the target variable <i>T</i> given <i>X<sub>i</sub></i>. Undirected edges indicate dependencies, which involve <i>T</i> in different ways, and for conditional independencies between <i>X<sub>i</sub></i> and <i>X<sub>j</sub></i> given <i>T</i>. Consider a case of non-faithful distribution where <i>T</i> is an XOR function of <i>X</i>1 and <i>X</i>2 with carefully set parameters so that from data it looks like <i>X</i>1 and <i>X</i>2 are marginally independent of <i>T</i>. In this case, <i>X</i>1 and <i>X</i>2 would be conditionally dependent when conditioning on each other. This distribution would be represented as two dotted nodes with a dotted line between them, but disconnected from <i>T</i>. This kind of graph signals a non-faithful distribution where the neighbourhood and Markov blanket are not defined by transversing undirected edges from <i>T</i>.</p

    Dependency of early cardiac enhancer activities on <i>tin</i>.

    No full text
    <p>Shown are stage 11–12 embryos stained for enhancer activities (anti-βGal or anti-GFP) and Tin (green). (A–K) Enhancer activities in wild type backgrounds (left corner quadrants: anti-Tin omitted for better visualization of reporter patterns; arrow heads: early cardiac expression). (A′–K′) Enhancer activities in homozygous <i>tin</i><sup>346</sup> mutant backgrounds. (A, A′) <i>EgfrE1</i>-LacZ expression in cardiac mesoderm but not in somatic mesoderm (asterisks) requires <i>tin</i>. (B, B′) <i>fzL4-</i>GFP expression in cardioblast progenitors requires <i>tin</i>. (C, C′) High-level <i>HimL47</i>-GFP expression in cardiogenic mesoderm requires <i>tin</i> but somatic mesodermal expression does not. (D, D′) <i>lin-28L64</i>-GFP expression in cardiac mesoderm requires <i>tin</i>. Amnioserosa expression is unaffected in <i>tin</i> mutants. (E, E′) <i>midE19</i>-GFP expression in cardioblast progenitors requires <i>tin</i>. (F, F′) <i>RhoLE102</i>-GFP expression in cardiogenic mesoderm, but not in somatic mesoderm, requires <i>tin</i>. (G, G′) <i>tshL8</i>-LacZ expression in cardiogenic mesoderm, but not in somatic mesoderm, requires <i>tin</i>. (H, H′) <i>tupE9</i>-GFP expression in cardiogenic mesoderm requires <i>tin</i>. (I, I′) <i>unc-5L25</i>-GFP expression in cardiac mesoderm but not in somatic mesoderm requires <i>tin</i>. (J, J′) <i>CG3638L6</i>-GFP expression in cardioblast progenitors requires <i>tin</i>. (K, K′) <i>CG9973E15</i>-GFP expression in cardioblast progenitors requires <i>tin</i>.</p

    Dependency of late cardiac enhancer activities within the dorsal vessel on <i>tin</i>.

    No full text
    <p>Shown are reporter activities (anti-GFP, green), Tin<sup>+</sup> cardioblasts and pericardial cells (anti-Tin, red) and Doc<sup>+</sup> cardioblasts (anti-Doc, blue) in stage 15–16 control embryos (A–D) and in embryos specifically lacking Tin activity in cardiac cells (<i>tinABD</i>, <i>tin<sup>346</sup></i>; A′–D′). (A) <i>midE19</i>-GFP is expressed specifically in the Tin<sup>+</sup> cardioblasts. (A′) Absence of cardiac Tin expression causes a severe reduction of <i>midE19-</i>GFP activity. (B) <i>tupE9</i>-GFP is highly expressed in Tin<sup>+</sup> cardioblasts (graded posteriorly-to-anteriorly) and, at much lower levels perduring from stage 12 expression, is present in Doc<sup>+</sup> cardioblasts, pericardial cells, and dorsal somatic muscles. (B′) Upon loss of cardiac Tin expression almost all cardioblasts contain only low levels of perduring GFP. (C) <i>unc-5L25</i>-GFP expression in pericardial cells and (largely posteriorly) in cardioblasts. (C′) Absence of cardiac Tin expression causes near loss of cardioblast <i>unc-5L25</i>-GFP expression and a reduction of expression in pericardial cells. (D) <i>CG3638L6</i>-GFP is expressed in Tin<sup>+</sup> cardioblasts (with variable intensities) and in Tin<sup>+</sup> pericardial cells. (D′) Absence of cardiac Tin expression causes nearly complete loss of <i>CG3638L6</i>-GFP expression.</p
    corecore